###Homework for Lab 1: DUE Friday, October 5th
The 1000 Genomes Project, which ran between 2008 and 2015, is as close as it comes to a “catalogue of human variation.” The output of this initiative is a database of whole genome sequences from 26 distinct populations from around the world, all aligned to the same human reference sequence. This data is free to use, and is an excellent resource for researchers who want to study genetic variation in a gene across populations, but cannot afford to collect their own samples.
While it was active, The 1000 Genomes Project published their data in several phases; by the final phase (Phase 3), they had gathered samples from 2,504 individuals from the 26 targeted populations. In 2015, the International Genome Sample Resource (IGSR) was established to “ensure the future usability and accessibility of the 1000 Genomes data.” In keeping with this goal, the IGSR has: re-mapped the Phase 3 data to the latest two human reference sequences, GRCh37 and GRCh38, incorporated externally generated, published genomic data (such as RNA-seq data) into their own dataset, and begun adding previously unsampled populations the database.
Below is a map of the current populations represented in the 1000
Genomes dataset, as well as a reference list of the abbreviations used
to identify these populations.
One last thing to note is that each of these populations falls under
a “super population” which denotes the general area of the world each
population is from. Many times, you will see information split up by
these super populations instead of by each individual population. These
super populations are as follows:
For the in-class labs this semester, each of you will be assigned a focal sub-population for our investigations into UCP1. Unless told otherwise, this is the population you’ll be tracking variation in for the remainder of our time investigating UCP1. You can find your assigned population here!
For more information about 1000 Genomes and IGSR, visit http://www.internationalgenome.org/home.
Bioinformatics is the “science of developing methods and
software tools for collecting and understanding biological data.” It’s
become huge academic and professional field in a relatively short time
as big datasets proliferate in biology, thanks to rapid developments in
sequencing technology and the advances in the various ‘-omics’
fields.
BU has an interdisciplinary Master’s Program in
Bioinformatics, a Bioinformatics Research and
Interdisciplinary Training Experience (BRITE REU) for
undergraduate students, as well as a Bioinformatics Hub (the
BUBHUB) meant to support faculty and students conducting
research in bioinformatics. These might be good resources if you decide
you like this kind of work and want to pursue it further.
Starting this semester, there’s even a student-led
Biology/Bioinformatics Peer Coding Hour here at BU,
where undergraduate and graduate students help each other with
bioinformatics and statistical coding issues. The Peer Coding Hour will
have their first meet-and-greet (with free food!) on October 2nd
from 4-5pm in BRB 113. I recommend everyone join!
The 1000 Genomes Project, or even digitally recording the information
DNA gives us, would not have been possible without this field. To
understand the files that we will be working with (such as VCF files,
which we will discuss later), it is beneficial to know how raw data is
transformed in to digital information. In order to explain this process,
I have included a simple flowchart that I will walk through.
The first step in this flowchart is the DNA sequencing itself. There
are several kinds of sequencing, but we know from the 1000 Genomes paper
we read that they used what is called an Illumina platform. Illumina
uses a specific method of next-generation sequencing
(NGS on the diagram). NGS is is a faster, more efficient, and more
in-depth process of sequencing that is based on shearing the genome into
small pieces and then reconstructing them en masse and in parallel
(multiple times at once) using various proprietary technologies before
mapping those pieces to a reference genome (typically
the first, highest quality, or most completely sequenced individual
genome of a species; this does not mean this is the
representative, average, or most common version of the genome!). The
proprietary Illumina platform was invented by Illumina, and uses a unique method
of sequencing that makes it among the most efficient, affordable, and
accurate ways of sequencing that we have today. BU has it’s own Illumina
sequencing facility on campus. Illumina sequencing in itself is an
incredibly complex process that we won’t talk about in detail here, but
a good video that explains the process can be found here.
DNA sequence reads don’t come out of the machine nicely put
together and cleaned, however. There are a few more steps to turn them
in to nice, neat files. As shown in the diagram, the output of a
sequencing machine is called a Fastq file. A Fastq file
consists of a raw nucleotide sequence that is not yet aligned to a
reference genome, and accompanying quality scores,
which are scores that tell us how reliable the sequencing read for each
base is. You can work with these files, but without aligning them to a
reference genome we won’t be able to get as much from them as we want.
That’s where the next step in the diagram comes in…
Alignment is the process of taking a chunk of DNA
sequence and using a statistical algorithm to compare that chunk to a
reference genome to figure out what portion of the genome that chunk
represents. This is done with all the small sequence chunks that come
from the initial Fastq file until you have a fully aligned genome. Once
you have aligned your Fastq sequence to a reference sequence, you have a
BAM file. A BAM file therefore not only contains an
entire genome’s worth of genetic code, but also gives information about
where any particular piece of code falls within the genome. These files
are good to work with if you need an entire genome’s worth of
information, or detailed information about every nucleotide in a region.
The final step in the flowchart is the VCF
file, which is what we will be working with in our class. VCF files are
the result of picking out just the variant nucleotide positions
(in other words, loci where individual sequences differ from the
reference) from a BAM file. Below, we will look at the VCF file type
more in-depth, as we will be using VCF files in this class.
In our labs, we will be using VCF files to look at our candidate
gene, UCP1. The VCF file format is a computer file format in
which variant genetic information can be stored, as we have seen above.
VCF files in particular are a way of formatting SNP-only information
without having to deal with the rest of the sequence that the SNPs come
from. Other file types, such as BAM files, have their own uses but for
the purposes of our study (and most population genetics studies) they
simply contain way more information than we need: a single BAM file
containing an entire genome can be almost a terabyte (1000 gigabytes) in
size!
VCF files are a text-file format which can be opened with a plain
text editor on your computer, and can be analyzed using various
softwares. Below I have included an example screenshot of what a VCF
file looks like when opened in a plain text editor. This example
compares what a simple representation of the sequence itself aligned to
the reference (‘Alignment’) looks like in VCF format (‘VCF
representation’):
As you can see from parts (b-g) of the figure, there is different
notation that can be used depending on what type of SNP or variant
position is being recorded. If you’re interested in more complex
bioinformatic analyses with data like this, there’s more information
about VCF files here.
Links to, and information for, all of the genome browsers that
feature 1000 Genomes data is found here.
While the 1000 Genomes project was still active, it had its
very own “early access” genome browsers that allowed researchers to get
detailed information about specific genes of interest. These browsers,
still available today, contain open-access data that was updated with
each phase of the project. These browsers are now outdated, however, and
the most up-to-date genomic alignments for the 1000 Genomes data are
generated by and stored in Ensembl. Ensembl is a
genome database that is maintained by the European Bioinformatics Institute, and
houses genomic data for many different species. Ensembl also
has several versions of each dataset, which are updated as new alignment
information becomes available.
For this class, we will be using the most up-to-date version of the
Ensembl human geneome browser, the GRCh38.p12
browser, to look at our gene of interest, UCP1.
Go to the website: http://useast.ensembl.org/Homo_sapiens/Info/Index
Find the search bar in the top left-hand corner and type in
“UCP1.” Make sure the “category” drop-down menu is set to “Search all
categories.” Click “Go.”
Congratulations, you’ve found your gene of interest in
Ensembl! Now, let’s visually explore the genomic information
available for UCP1…
If you click on the “Go to region in detail” option directly above
this image, you will get a more detailed version of the coding region.
If you’re interested, you can do this on your own time; we will not be
using this more detailed view for the purposes of this class. What we
will look at, however, is an image that shows all of the variants in
UCP1.
Oh no: JUST IN THE LAST MONTH, Ensembl decided to
retire this view of the human genome, as known variants have
proliferated to the point where there’s so much information in
this view that it’s become more confusing than helpful. I’ll show you
what it used to look like, just to give you an idea of the information
it contains (note, you can still get this view for non-human
sequences).
In species where this still works, you will find an
interactive image that looks like this:
You can click on each little variant box as well, which will give you
information about each SNP, such as its rs ID number, its location, and
what kind of mutation it is:
While it’s a shame that this image view is no longer available, the
far more helpful data format - the variant table - is
still here!
You will get a table that looks like this:
As you can see under the “Conseq. Type” column, all of the variants
at the top of the table are downstream variants (e.g., variants that
take place past the stop codon of the UCP1 protein coding
region).
Let’s do a little exercise, shall we? Above the
table, there are some filtering options. Click on the “Consequences”
filtering option and hit “Turn All Off”
Now, turn back on all of the mutations that lie within the coding
region of the gene that will also cause a change in
genotype. After you’re done choosing, hit “Apply Changes.”
Now that we have a shortlist of SNPs that will actually come in
handy for understanding variation in the protein output of
UCP1, let’s learn how to get useful information about a
specific SNP that we will use in later labs.
The first thing we’ll look is the Global Minor Allele Frequency
(MAF). MAF is simply the frequency at which the second most common
variant allele from the reference genome (i.e. the minor
allele) occurs in a population. We will be able to get population
breakdowns of the MAF, but first we’ll look at the Global MAF.
Under “Explore This Variant” there is several different tabs that will tell us different things about the SNP. In this class, we will be using two of these features: “Population Genetics” and “Linkage Disequilibrium”. Right now, we will do a quick tour of these tabs so you can just see how they work.
First, click on the “Population Genetics” icon. You will get this
page:
As you can see in these pie charts that there appear to be differences in the allele frequencies for this SNP in each population and sub-population.
Scroll down to the table that gives you the allele frequency
breakdown for all the 1000 Genomes populations. Here is an example
section of the 1000 Genomes data table:
As an exercise, find your population in this table. Make
sure you can find the allele and genotype counts. This will come in
handy when we do our Hardy-Weinberg
Lab.
The final page we’ll explore is the “Linkage Disequilibrium” page.
Find your population.
First, click on the right-most “View Plot” link for your population. The icon is a reddish triangle. A few things will come up on this page.
The first thing you should see on the page is an image of
Chromosome 4, with a red line marking the locus of our SNP of interest:
Knowing what we know about recombination and the structure of the genome, think about how the location of this SNP on the chromosome will affect the likelihood of Linkage Disequilibrium. Is it more or less likely that this SNP will be out of Linkage Disequilibrium?
This plot represents Linkage Disequilibrium blocks. When we talk about Linkage Disequilibrium, we will learn how to read one of these blocks. For now, it’s enough to know what it looks like.
Now, navigate back to the table on the previous page. For your
population, click on the “Show” link under the “Variants in High LD”
column. If your population has SNPs in high Linkage Disequilibrium, you
will get a table that looks like this:
If your population doesn’t have SNPs in high Linkage
Disequilibrium, think about what that tells us about the importance of
UCP1 in your population.
Now that we have explored
some important features of the Ensembl website, we can learn
how to download some more focused datasets of our own that we’ll use
here in our Labs!
Usually when you are working with genomic information, you are given
a whole chromosome or even a whole genome’s worth of information in
either a BAM file or a VCF file. If you only need to look at one small
part of the genome, it can be very annoying to work with a lot of extra
data. The Data Slicer in Ensembl is a convenient way
to get only the amount of data that you want without using a separate
program to cut it out yourself. We will use this tool to get the data
for our analysis of UCP1. We will be taking two slices
of data today: one that contains all of the SNPs in UCP1 (we
may have already done this in our Pre-Lab exercises), and one that
contains only about 1/4 of the SNPs in the gene. We will be using the
larger slice in most of our analyses, but we will need the smaller slice
for Lab 3.
The link to the Data Slicer
is available here: https://useast.ensembl.org/Homo_sapiens/Tools/DataSlicer.
IMPORTANT NOTE: As many of you may have
already noticed, sometimes Data Slicer sucks and forgets how to
link to the Phase 3 dataset appropriately. This makes things a bit more
complicated. As such, I’ve appended two methods, below, to getting the
data we need. The firt tab - Data Slicer WINS! - contains
instructions for when Data Slicer is actually working. The
second tab - Data Slicer SUCKS! - is for when Data
Slicer fails us. The second method isn’t harder, it just involves
post-download filtering of our file, which might take a wee smidge
longer.
Try the Data Slicer WINS! method first. If that fails, it’s
time to move on the the Data Slicer SUCKS! method.
Now, on to the tutorial!
*The file format should be set for VCF. If it’s not, click the drop-down menu and select VCF.
In the “region lookup” bar, copy and paste in the location 4:140559434-140568805. These are the GRCh38.p12 version alignment coordinates for the gene UCP1. This is the larger of the two chunks we will be taking.
In in the “Choose data collections…” dropdown list, make sure “Phase 3” is selected. This will ensure you get data from the last phase of the 1000 Genomes project.
In the “filters” category, select “By populations”. This will give you a dropdown menu of all of the 1000 Genomes populations. Select the population that you were assigned, so that you only get the data for that population.
The filled-in interface should look like this:
If this has worked, and your window looks more or less like the image
above (the entries above might not match yours), then Data
Slicer is winning and you can continue!
If this hasn’t worked and Data Slicer is giving you some kind of error message asking for some URL or other… well… Data Slicer sucks and it’s time to get resourceful, so click on that tab and check out what to do next…
Now, some students have noticed that this second process literally takes forever… in that case, if all else fails… we’ll use Tabix!
#####Data Slicer WINS!
Hit the “run” button at the bottom of the page.
At the top of the page, hit “New Job” and repeat this process, but this time with the coordinates 4:140560308-140564106. Let’s name this one something slightly different so we don’t get confused. These are the coordinates for the smaller slice that we will be taking, so I’ll name mine “YRI_UCP1_small”
When you have clicked “Run” for both jobs, you will see this
table pop up, which will eventually tell you when your job has been
processed. Click “View Results” to look at your results.
We check our file to see if the body is there because sometimes the
server will malfunction and give you only the head of the VCF file. If
that happens, repeat the Data Slicer process. Check both files
in this same way.
Once you’ve checked your files to make sure everything is there, click on “Download results file”, which should save these files in filename.vcf.gz format in the ‘Downloads’ folder in your anth333 workspace on the SCC.
Once in your workspace, change your filenames to make things easier. Rename the full UCP1 file with the acronym for your population. Name the smaller file with the acronym and then “small.” For example, I downloaded data from the YRI population, so I named the larger file “YRI.vcf.gz” and will name the smaller file “YRIsmall.vcf.gz”. Remember, to rename a file in the SCC workspace, you use the mv command (e.g., ‘mv oldname.vcf.gz YRI.vcf.gz’)
#####Data Slicer SUCKS!
Gah… this bug in Data Slicer is THE WORST. But it’s also not the end of the world. We can still get what we need from Ensembl, it’ll just take a bit longer. The bonus is that we’ll learn how to use a couple extra bioinformatics tools that are actually preferred over Data Slicer by folks who work frequently with human genomic data.
Retrace your steps to get to the Data Slicer and re-enter all the naming and coordinate information you had before.
In in the “Choose data collections…” dropdown list, where it says “Phase 3”, click and choose “Provide file URLs”.
In the blank box that opens up next to “Genotype file URL”, copy
the following URL and paste it into the box:
Hit the “run” button at the bottom of the page.
At the top of the page, hit “New Job” and repeat this process, but this time with the coordinates 4:140560308-140564106. Let’s name this one something slightly different so we don’t get confused. These are the coordinates for the smaller slice that we will be taking, so I’ll name mine “YRI_UCP1_small”
When you have clicked “Run” for both jobs, you will see this
table pop up, which will eventually tell you when your job has been
processed. Click “View Results” to look at your results.
We check our file to see if the body is there because sometimes the
server will malfunction and give you only the head of the VCF file. If
that happens, repeat the Data Slicer process. Check both files
in this same way.
Once you’ve checked your files to make sure everything is there, click on “Download results file”, which should save these files in filename.vcf.gz format in the ‘Downloads’ folder in your anth333 workspace on the SCC.
Once in your workspace, change your filenames to make things easier. Rename the full UCP1 file with the acronym for your population. Name the smaller file with the acronym and then “small.” For example, I downloaded data from the YRI population, so I named the larger file “YRI.vcf.gz” and will name the smaller file “YRIsmall.vcf.gz”. Remember, to rename a file in the SCC workspace, you use the mv command (e.g., ‘mv oldname.vcf.gz YRI.vcf.gz’)
#####It’s TABIX time…
Ok, so nothing has worked… we’re tired of working with Data
Slicer!
There’s a solution: tabix!
tabix is a module that you can load into your SCC workspace, like any other module.
[caschmit@scc1 ~]$
module load tabix
tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chr4_GRCh38.genotypes.20170504.vcf.gz 4:140559434-140568805 > YRI_all.vcf
tabix -h [data URL] [region lookup] > [filename].vcf
This has saved the UCP1 region as a VCF formatted file into your working directory!
If you want to take a look at the file, and see that it’s all there, you can use the ‘less’ command and scroll with the ‘down arrow’ key to see the file itself. The screenshot below is of the first look, before you start scrolling. Notice there are A LOT of individuals (over 2,500), with names starting with ‘HG’ or ‘NA’. To leave the scrolling and go back to the prompt, press ‘q’:
less YRI_all.vcf
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel
less integrated_call_samples_v3.20130502.ALL.panel
grep YRI integrated_call_samples_v3.20130502.ALL.panel | cut -f1 > YRI.samples.list
less YRI.samples.list
module load vcftools
vcftools --vcf YRI_all.vcf --keep YRI.samples.list --recode --out YRI
mv YRI.recode.vcf YRI.vcf
less YRI.vcf
tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chr4_GRCh38.genotypes.20170504.vcf.gz 4:140560308-140564106 > YRIsmall_all.vcf
vcftools --vcf YRIsmall_all.vcf --keep YRI.samples.list --recode --out YRIsmall
mv YRIsmall.recode.vcf YRIsmall.vcf